SOCR ≫ DSPA ≫ Topics ≫

Most data-driven scientific inference, qualitative, quantitative and visual analytics involve formulating, understanding the behavior of, and optimizing objective (cost) functions. Presenting the mathematical foundations of representation and interrogation of diverse spectra of objective functions provides mechanisms for obtaining effective solutions to complex big data problems. (Multivariate) function optimization (minimization or maximization) is the process of searching for variables \(x_1, x_2, x_3, \ldots, x_n\) that either minimize or maximize the multivariate cost (objective) function \(f(x_1, x_2, x_3, \ldots, x_n)\). In this Chapter, we will specifically discuss (1) constrained and unconstrained optimization, (2) Lagrange multipliers, (3) linear, quadratic and (general) non-linear programming, and (4) data denoising.

1 General optimization approach

In its most general framework most continuous optimization algorithms involve iterative traversing the domain and assessing the change of the objective function. The process may start by specifying specific initial conditions or randomly choosing the starting point in the domain, the traversal pattern and the updates of the cost-function estimates. The last part is computed step-wise using some fixed update mechanism that leads to the iterative estimation of the next domain point. The updating function typically involves the relative change of the objective function, e.g., gradient computed at past and the current and locations. Gradient descent optimization relies on updates computed using the negative gradient, whereas the updating of the points in momentum-based optimization, an alternative to stochastic gradient descent, uses scaled exponential moving average of the gradients. The main differences between alternative optimization algorithms are the objective function and the update protocol. The pseudo code below defined the general algorithmic framework for unconstrained continuous optimization. Following the (random or seeded) initialization of the algorithm (\(x_o\)) in the domain of the objective function, we traverse the domain by iteratively updating the current location (\(x_i\)) step-by-step using a predefined learning rate, or step-size, (\(\gamma\)), a momentum decay factor (\(\alpha\)), and a functor (\(\phi\)) of the objective function (\(f\)) and it’s gradient (\(\nabla f\)), as well as the current location (\(x_{i-1}\)) and all the past locations (\(\{x_j\}_{j=o}^{i-1}\)):

\[\begin{array}{rcl} \textbf{Generic} & \textbf{Pseudo Optimization} & \textbf{Algorithm} \\ Initialization: & \text{Objective function: } f, & \text{random (seeded) point in domain: } x_o \\ Iterator: & \textbf{for } i=1, 2, 3,... & \textbf{do} \\ & \Delta x = & \phi\left ( \{x_j, f(x_j), \nabla f(x_j) \}_{j=0}^{i-1}\right )= \begin{cases} \text{gradient descent:} & \phi(.)=-\gamma\nabla f(x_{i-1}) \\ \text{stochastic gradient descent:} & \phi(.)=-\gamma\nabla f(x_{i-1})+\xi (i-1), \ \xi (i-1) \text{ is stochastic noise} \\ \text{momentum method:} & \phi(.)=-\gamma \left ( \sum_{j=0}^{i-1}{\alpha^{(i-j-1)} \nabla f(x_{j})} \right ) \\ \text{neural net ML:} & \phi(.): \ W_{i,j}^{layer}=W_{i,j}^{layer}-\gamma \frac{\partial J}{\partial W_{i,j}^{layer}},\ J=\text{NN error},\ W_{i,j}^{layer}=\text{weight coefficients} \end{cases} \\ & \text{If stopping criterion is met,} & \text{then return location: } x_{i-1} \\ & \text{Otherwise,} & \text{update the location: } x_i = x_{i-1}+\Delta x \\ & \text{End} & \textbf{for } \text{ loop} \end{array}\]

Various performance metrics may be used to drive the learning in the optimization process. Such loss metrics reward good optimizers or penalize bad optimizers, respectively. When minimizing an objective function, the loss representing the sum of the objective values over all iterations, i.e., the cumulative regret, leads to good optimizers that converge rapidly.

2 Free (unconstrained) optimization

Unconstrained function optimization refers to searching for extrema without restrictions for the domain of the cost function, \(\Omega \ni \{x_i\}\). The extreme value theorem suggest that a solution to the free optimization processes, \(\min_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}\) or \(\max_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}\), may be obtained by a gradient vector descent method. This means that we can minimize/maximize the objective function by finding solutions to \(\nabla f = \{\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_n}\}=\{0, 0, \ldots, 0\}\). Solutions to this equation, \(x_1, \ldots, x_n\), will present candidate (local) minima and maxima.

In general, identifying critical points using the gradient or tangent plane, where the partial derivatives are trivial, may not be sufficient to determine the extrema (minima or maxima) of multivariate objective functions. Some critical points may represent inflection points, or local extrema that are far from the global optimum of the objective function. The eigen-values of the Hessian matrix, which includes the second order partial derivatives, at the critical points provide clues to pinpoint extrema. For instance, invertible Hessian matrices that (1) are positive definite (i.e., all eigenvalues are positive), yield a local minimum at the critical point, (2) are negative definite (all eigenvalues are negative) at the critical point suggests that the objective function has a local maximum, and (3) have both positive and negative eigenvalues yield a saddle point for the objective function at the critical point where the gradient is trivial.

There are two complementary strategies to avoid being trapped in local extrema. First, we can run many iterations with different initial vectors. At each iteration, the objective function may achieve a (local) maximum/minimum/saddle point. Finally, we select the overall minimal (or maximal) value from all iterations. Another adaptive strategy involves either adjusting the step sizes or accepting solutions in probability, e.g., simulated annealing is one example of an adaptive optimization.

2.1 Example 1: minimizing a univariate function (inverse-CDF)

The cumulative distribution function (CDF) of a real-valued random process \(X\), also known as the distribution function of \(X\), represents the probability that the random variable \(X\) does not exceed a certain level. Mathematically speaking, the CDF of \(X\) is \(F_X(x) = P(X\leq x)\). Recall the Chapter 1 discussions of Uniform, Normal, Cauchy, Binomial, Poisson and other discrete and continuous distributions. Try the Probability Distributome Navigator. Also explore the dynamic representations of density and distribution functions included in the Probability Distributome Calculators.

For each \(p\in [0,1]\), the inverse distribution function, also called quantile function (e.g., qnorm), yields the critical value (\(x\)) at which the probability of the random variable is less than or equal to the given probability (\(p\)). When the CDF \(F_X\) is continuous and strictly increasing, the value of the inverse CDF at \(p\), \(F^{-1}(p)=x\), is the unique real number \(x\) such that \(F(x)=p\).

Below, we will plot the probability density function (PDF) and the CDF for Normal distribution in R.

par(mfrow=c(1,2), mar=c(3,4,4,2))

z<-seq(-4, 4, 0.1)  # points from -4 to 4 in 0.1 steps
q<-seq(0.001, 0.999, 0.001)  # probability quantile values from 0.1% to 99.9% in 0.1% steps

dStandardNormal <- data.frame(Z=z, Density=dnorm(z, mean=0, sd=1), Distribution=pnorm(z, mean=0, sd=1)) 

plot(z, dStandardNormal$Density, col="darkblue",xlab="z", ylab="Density", type="l",lwd=2, cex=2, main="Standard Normal PDF", cex.axis=0.8)
# could also do
# xseq<-seq(-4, 4, 0.01); density<-dnorm(xseq, 0, 1); plot (density, main="Density")

# Compute the CDF
xseq<-seq(-4, 4, 0.01); cumulative<-pnorm(xseq, 0, 1) 
# plot (cumulative, main="CDF")
plot(xseq, cumulative, col="darkred", xlab="", ylab="Cumulative Probability", type="l",lwd=2, cex=2, main="CDF of (Simulated) Standard Normal", cex.axis=.8)

The example below shows a semi-interactive standard normal distribution calculator using plot_ly. You can see many other probability distribution calculators on the Distributome site.

library(plotly)
library(quantmod)

z<-seq(-4, 4, 0.1)  # points from -4 to 4 in 0.1 steps
q<-seq(0.001, 0.999, 0.01)  # probability quantile values from 0.1% to 99.9% in 0.1% steps

dStandardNormal <- data.frame(Z=z, Density=dnorm(z, mean=0, sd=1), Distribution=pnorm(z, mean=0, sd=1)) 

# plot(z, dStandardNormal$Density, col="darkblue",xlab="z", ylab="Density", type="l",lwd=2, cex=2, main="Standard Normal PDF", cex.axis=0.8)
# polygon(z, dStandardNormal$Density, col="red", border="blue")

dStandardNormal$ID <- seq.int(nrow(dStandardNormal))

aggregate_by <- function(dataset, feature) {
  feature <- lazyeval::f_eval(feature, dataset)
  levels <- plotly:::getLevels(feature)
  aggData <- lapply(seq_along(levels), function(x) {
    cbind(dataset[feature %in% levels[seq(1, x)], ], frame = levels[[x]])
  })
  dplyr::bind_rows(aggData)
}

dStandardNormal <- dStandardNormal %>% aggregate_by(~ID)

plotMe <- dStandardNormal %>%
  plot_ly(
    x = ~Z, 
    y = ~Density, 
    frame = ~frame,
    type = 'scatter', 
    mode = 'lines', 
    fill = 'tozeroy', 
    fillcolor="red",
    line = list(color = "blue"),
    text = ~paste("Z: ", Z, "<br>Density: ", Density,
                  "<br>CDF: ", Distribution), 
    hoverinfo = 'text'
  ) %>%
  layout(
    title = "Standard Normal Distribution",
    yaxis = list(
      title = "N(0,1) Density", 
      range = c(0,0.45), 
      zeroline = F,
      tickprefix = "" # density value
    ),
    xaxis = list(
      title = "Z", 
      range = c(-4,4), 
      zeroline = T, 
      showgrid = T
    )
  ) %>% 
  animation_opts(
    frame = 100, 
    transition = 1, 
    redraw = FALSE
  ) %>%
  animation_slider(
    currentvalue = list(
      prefix = "Z: "
    )
  )
plotMe 
# If you want to create a shareable plotly link to the Interactive Normal Distribution Calculator, see the API for establishing credentials: https://plot.ly/r/getting-started
# stdNormalCalculator = api_create(p, filename="SOCR/interactiveStdNormalCalculator")
# stdNormalCalculator

Suppose we are interested in computing, or estimating, the inverse-CDF from first principles. Specifically, to invert the CDF, we need to be able to solve the following equation (representing our objective function): \[CDF(x)-p=0.\]

The stats::uniroot and stats::nlm R functions do non-linear minimization of a function \(f\) using a Newton-Raphson algorithm. Let’s test that optimization using \(N(\mu=100,\sigma=20)\).

set.seed(1234)
x <- rnorm(1000, 100, 20)
pdf_x <- density(x)

# Interpolate the density, the values returned when input x values are outside [min(x): max(x)] should be trivial 
f_x <- approxfun(pdf_x$x, pdf_x$y, yleft=0, yright=0)

# Manual computation of the cdf by numeric integration
cdf_x <- function(x){
  v <- integrate(f_x, -Inf, x)$value
  if (v<0) v <- 0
  else if(v>1) v <- 1
  return(v)
}

# Finding the roots of the inverse-CDF function by hand (CDF(x)-p=0)
invcdf <- function(p){
  uniroot(function(x){cdf_x(x) - p}, range(x))$root
  # alternatively, can use
  # nlm(function(x){cdf_x(x) - p}, 0)$estimate  
  # minimum - the value of the estimated minimum of f.
  # estimate - the point at which the minimum value of f is obtained.
}
invcdf(0.5)
## [1] 99.16995
# We can validate that the inverse-CDF is correctly computed: F^{-1}(F(x))==x
cdf_x(invcdf(0.8))
## [1] 0.8

The ability to compute exactly, or at least estimate, the inverse-CDF function is important for many reasons. For instance, generating random observations from a specified probability distribution (e.g., normal, exponential, or gamma distribution) is an important task in many scientific studies. One approach for such random number generation from a specified distribution evaluates the inverse CDF at random uniform \(u\sim U(0,1)\) values. Recall that in Chapter 15 we showed an example of generating random uniform samples using atmospheric noise. The key step is ability to quickly, efficiently and reliably estimate the inverse CDF function, which we just showed one example of.

Let’s see why inverting the CDF using random uniform data works. Consider the cumulative distribution function (CDF) of a probability distribution from which we are interested in sampling. If the CDF has a closed form analytical expression and is invertible, then we generate a random sample from that distribution by evaluating the inverse CDF at \(u\), where \(u \sim U(0,1)\). This is possible since a continuous CDF, \(F\), is a one-to-one mapping of the domain of the CDF (range of \(X\)) into the interval \([0,1]\). Therefore, if \(U\) is a uniform random variable on \([0,1]\), then \(X = F^{-1}(U)\) has the distribution \(F\). Suppose \(U \sim Uniform[0,1]\), then \(P(F^{-1}(U) \leq x)= P(U \leq F(x))\), by applying \(F\) to both sides of this inequality, since \(F\) is monotonic. Thus, \(P(F^{-1}(U) \leq x)= F(x)\), since \(P(U \leq u) = u\) for uniform random variables.

2.2 Example 2: minimizing a bivariate function

Let’s look at the function \(f(x_1, x_2) = (x_1-3)^2 + (x_2+4)^2\). We define the function in R and utilize the optim function to obtain the extrema points in the support of the objective function and/or the extrema values at these critical points.

require("stats")
f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 }
initial_x <- c(0, -1)
x_optimal <- optim(initial_x, f, method="CG") # performs minimization
x_min <- x_optimal$par
# x_min contains the domain values where the (local) minimum is attained
x_min   # critical point/vector
## [1]  3 -4
x_optimal$value  # extrema value of the objective function
## [1] 8.450445e-15

optim allows the use of 6 candidate optimization strategies:

  • Nelder-Mead: robust but relatively slow, works reasonably well for non-differentiable functions.
  • BFGS: quasi-Newton method (also known as a variable metric algorithm), uses function values and gradients to build up a picture of the surface to be optimized.
  • CG: conjugate gradients method, fragile, but successful in larger optimization problems because it’s unnecessary to save large matrix.
  • L-BFGS-B: allows box constraints.
  • SANN: a variant of simulated annealing, belonging to the class of stochastic global optimization methods.
  • Brent: for one-dimensional problems only, useful in cases where optim() is used inside other functions where only method can be specified.

2.3 Example 3: using simulated annealing to find the maximum of an oscillatory function

Consider the function \(f(x) = 10 \sin(0.3 x)\times \sin(1.3 x^2) - 0.00002 x^4 + 0.3 x+35\). Maximizing \(f()\) is equivalent to minimizing \(-f()\). Let’s plot this oscillatory function, find and report its critical points and extremum values.

The function optim returns two important results:

  • par: the best set of domain parameters found to optimize the function
  • value: the extreme values of the function corresponding to par.
funct_osc <- function (x) { 
  -(10*sin(0.3*x)*sin(1.3*x^2) - 0.00002*x^4 + 0.3*x+35) 
}
res <- optim(16, funct_osc, method = "SANN", control = list(maxit = 20000, temp = 20, parscale = 20))
res$par
## [1] 15.66197
res$value
## [1] -48.49313
plot(funct_osc, -50, 50, n = 1000, main = "optim() minimizing an oscillatory function")
abline(v=res$par, lty=3, lwd=4, col="red")

3 Constrained Optimization

3.1 Equality constraints

When there are support restrictions, dependencies or other associations between the domain variables \(x_1, x_2, \ldots, x_n\), constrained optimization needs to be applied.

For example, we can have \(k\) equations specifying these restrictions, which may specify certain model characteristics: \[\begin{cases} g_1(x_1, x_2, \ldots, x_n) = 0\\ \ldots \\ g_k(x_1, x_2, \ldots, x_n) = 0 \end{cases} .\]

Note that the right hand sides of these equations may always be assumed to be trivial (\(0\)), otherwise we can just move the non-trivial parts within the constraint functions \(g_i\). Linear Programming, Quadratic Programming, and Lagrange multipliers may be used to solve such equality-constrained optimization problems.

3.2 Lagrange Multipliers

We can merge the equality constraints within the objective function (\(f \longrightarrow f^*\)). Lagrange multipliers represents a typical solution strategy that turns the constrained optimization problem (\(\min_{x} {f(x)}\) subject to \(g_i(x_1, x_2, \ldots, x_n)=0\), \(1\leq i\leq k\)), into an unconstrained optimization problem: \[f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \sum_{i=1}^k {\lambda_i g_i(x_1, x_2, \ldots, x_n)}.\]

Then, we can apply traditional unconstrained optimization schemas, e.g., extreme value theorem, to minimize the unconstrained problem:

\[f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \lambda_1 g_1(x_1, x_2, \ldots, x_n) + \ldots + \lambda_k g_k(x_1, x_2, \ldots, x_n).\]

This represent an unconstrained optimization problem using Lagrange multipliers.

The solution of the constrained problem is also a solution to: \[\nabla f^* = \left [\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_n}; \frac{d f}{d \lambda_1}, \frac{d f}{d \lambda_2}, \ldots, \frac{d f}{d \lambda_k} \right ] = [0, 0, \ldots, 0].\]

3.3 Inequality constrained optimization

There are no general solutions for arbitrary inequality constraints; however, partial solutions do exist when some restrictions on the form of constraints are present.

When both the constraints and the objective function are linear functions of the domain variables, then the problem can be solved by Linear Programming.

3.3.1 Linear Programming (LP)

LP works when the objective function is a linear function. The constraint functions are also linear combination of the same variables.

Consider the following elementary (minimization) example: \[ \min_{x_1, x_2, x_3} (-3x_1 -4x_2 -3x_3)\]

subject to: \[ \left\{ \begin{array}{rl} 6x_1 + 2x_2 + 4x_3 & \leq 150 \\ x_1 + x_2 + 6x_3 & \geq 0 \\ 4x_1 + 5x_2 + 4x_3 & = 40 \end{array} \right. \]

The exact solution is \(x_1 = 0, x_2 = 8, x_3 = 0\), and can be computed using the package lpSolveAPI to set up the constraint problem and the generic solve() method to find its solutions.

# install.packages("lpSolveAPI")
library(lpSolveAPI)

lps.model <- make.lp(0, 3) # define 3 variables
# add the constraints as a matrix of the linear coefficients, relations and RHS
add.constraint(lps.model, c(6, 2, 4), "<=", 150)
add.constraint(lps.model, c(1, 1, 6), ">=",   0)
add.constraint(lps.model, c(4, 5, 4), "=" ,  40)
# set objective function (default: find minimum)
set.objfn(lps.model, c(-3, -4, -3))  

# you can save the model to a file
# write.lp(lps.model, 'c:/Users/LPmodel.lp', type='lp')

# these commands define the constraint linear model 
# /* Objective function */
#   min: -3 x1 -4 x2 -3 x3;
# 
# /* Constraints */
# +6 x1 +2 x2 +4 x3 <= 150;
# +  x1 +  x2 +6 x3 >=   0;
# +4 x1 +5 x2 +4 x3  =  40;
#
# writing it in the text file named 'LPmodel.lp'

solve(lps.model)
## [1] 0
# Retrieve the values of the variables from a solved linear program model 
get.variables(lps.model)  # check against the exact solution x_1 = 0, x_2 = 8, x_3 = 0
## [1] 0 8 0
get.objective(lps.model) # get optimal (min) value
## [1] -32

In lower dimensional problems, we can also plot the constraints to graphically demonstrate the corresponding support restriction. For instance, here is an example of a simpler 2D constraint and its Venn diagrammatic representation.

\[ \left\{ \begin{array}{rl} x_1 & \leq \frac{150 -2x_2}{6} \\ x_1 & \geq -x_2\\ \end{array} \right. . \]

library(ggplot2)
ggplot(data.frame(x = c(-100, 0)), aes(x = x)) + 
  stat_function(fun=function(x) {(150-2*x)/6}, aes(color="Function 1")) + 
  stat_function(fun=function(x) { -x }, aes(color = "Function 2")) + 
  theme_bw() + 
  scale_color_discrete(name = "Function") + 
  geom_polygon(
    data = data.frame(x = c(-100, -100, 0, 0, Inf), y = c(0,350/6, 150/6, 0, 0)),
    aes(x = x, y = y, fill = "Constraint 1"),
    inherit.aes = FALSE, alpha = 0.5) + 
  geom_polygon(
    data = data.frame(x = c(-100, -100, 0, Inf), y = c(0, 100, 0, 0)),
    aes(x = x, y = y, fill = "Constraint 2"),
    inherit.aes = FALSE, alpha = 0.3) + 
  scale_fill_discrete(name = "Constraint Set") + 
  scale_y_continuous(limits = c(0, 100))

Here is another example of maximization of a trivariate cost function, \(f(x_1, x_2, x_3)=3x_1+ 4x_2 -x_3\), subject to: \[ \left\{ \begin{array}{rl} -x_1 + 2x_2 + 3x_3 & \leq 16 \\ 3x_1 - x_2 - 6x_3 & \geq 0 \\ x_1 - x_2 & \leq 2 \end{array} \right. . \]

lps.model2 <- make.lp(0, 3)
add.constraint(lps.model2, c(-1, 2, 3), "<=", 16)
add.constraint(lps.model2, c(3, -1, -6), ">=",  0)
add.constraint(lps.model2, c(1, -1, 0), "<=",  2)
set.objfn(lps.model2, c(3, 4, -1), indices = c(1, 2, 3)) 
lp.control(lps.model2, sense='max')     # changes to max: 3 x1 + 4 x2 - x3
## $anti.degen
## [1] "fixedvars" "stalling" 
## 
## $basis.crash
## [1] "none"
## 
## $bb.depthlimit
## [1] -50
## 
## $bb.floorfirst
## [1] "automatic"
## 
## $bb.rule
## [1] "pseudononint" "greedy"       "dynamic"      "rcostfixing" 
## 
## $break.at.first
## [1] FALSE
## 
## $break.at.value
## [1] 1e+30
## 
## $epsilon
##       epsb       epsd      epsel     epsint epsperturb   epspivot 
##      1e-10      1e-09      1e-12      1e-07      1e-05      2e-07 
## 
## $improve
## [1] "dualfeas" "thetagap"
## 
## $infinite
## [1] 1e+30
## 
## $maxpivot
## [1] 250
## 
## $mip.gap
## absolute relative 
##    1e-11    1e-11 
## 
## $negrange
## [1] -1e+06
## 
## $obj.in.basis
## [1] TRUE
## 
## $pivoting
## [1] "devex"    "adaptive"
## 
## $presolve
## [1] "none"
## 
## $scalelimit
## [1] 5
## 
## $scaling
## [1] "geometric"   "equilibrate" "integers"   
## 
## $sense
## [1] "maximize"
## 
## $simplextype
## [1] "dual"   "primal"
## 
## $timeout
## [1] 0
## 
## $verbose
## [1] "neutral"
solve(lps.model2)         # 0 suggests that this solution convergences
## [1] 0
get.variables(lps.model2) # get point of maximum
## [1] 20 18  0
get.objective(lps.model2) # get optimal (max) value
## [1] 132

In 3D we can utilize the rgl::surface3d() method to display the constraints. This output is suppressed, as it can only be interpreted via the pop-out 3D rendering window.

library("rgl")
# install.packages("rglwidget")
library("rglwidget")
n <- 100
x <- y <- seq(-500, 500, length = n)
region <- expand.grid(x = x, y = y)

z1 <- matrix(((150 -2*region$x -4*region$y)/6), n, n)
z2 <- matrix(-region$x + 6*region$y, n, n)
z3 <- matrix(40 -5*region$x - 4*region$y, n, n)

# open3d()
surface3d(x, y, z1, back = 'line', front = 'line', col = 'red', lwd = 1.5, alpha = 0.4)
surface3d(x, y, z2, back = 'line', front = 'line', col = 'orange', lwd = 1.5, alpha = 0.4)
surface3d(x, y, z3, back = 'line', front = 'line', col = 'blue', lwd = 1.5, alpha = 0.4)
axes3d()
rglwidget() 
# the use of rglwidget() allows the embedding of an interactive 3D plot into the HTML report

We can also use plot_ly to display the linear constraints (planes) in 3D.

#define the regular x,y grid (in 2D)
n <- 100
x <- y <- seq(-500, 500, length = n)
region <- expand.grid(x = x, y = y)

# define the z values for all three 2D planes
z1 <- matrix(((150 -2*region$x -4*region$y)/6), n, n)
z2 <- matrix(-region$x + 6*region$y, n, n)
z3 <- matrix(40 -5*region$x - 4*region$y, n, n)

#z.seq <- function(x,y) coef.lm.fit[1]+coef.lm.fit[2]*x+coef.lm.fit[3]*y
# define the values of z = z(x.seq, y.seq), as a Matrix of dimension c(dim(x.seq), dim(y.seq))
#z <- t(outer(x.seq, y.seq, z.seq))

library(plotly)
myPlotly <- plot_ly(x=~x, y=~y, z=~z1,
      colors = c("blue", "red"),type="surface", opacity=0.7) %>%
  add_trace(x=~x, y=~y, z=~z2, type="surface",
            colors = "green", opacity=0.5) %>%
  add_trace(x=~x, y=~y, z=~z3, type="surface", 
            colors = "orange", opacity=0.3) %>%
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x=1, y=1, z=1),
    xaxis = list(title = "X"),
    yaxis = list(title = "Y"),
    zaxis = list(title = "ZA"))
    )
# print(myPlotly)
myPlotly

It is possible to restrict the domain type to contain only solutions that are:

  • integers, which makes it an Integer Linear Programming (ILP),
  • binary/Boolean values (BLP), or
  • mixed types, Mixed Integer Liner Programming (MILP).

Some examples are included below.

3.3.2 Mixed Integer Linear Programming (MILP)

Let’s demonstrate MILP with an example where the type of \(x_1\) is unrestricted, \(x_2\) is dichotomous (binary), and \(x_3\) is restricted to be an integer.

lps.model <- make.lp(0, 3)
add.constraint(lps.model, c(6, 2, 4), "<=", 150)
add.constraint(lps.model, c(1, 1, 6), ">=", 0)
add.constraint(lps.model, c(4, 5, 4), "=", 40)
set.objfn(lps.model, c(-3, -4, -3))

set.type(lps.model, 2, "binary")
set.type(lps.model, 3, "integer")
get.type(lps.model) # This is Mixed Integer Linear Programming (MILP)
## [1] "real"    "integer" "integer"
set.bounds(lps.model, lower=-5, upper=5, columns=c(1))

# give names to columns and restrictions
dimnames(lps.model) <- list(c("R1", "R2", "R3"), c("x1", "x2", "x3")) 

print(lps.model)
## Model name: 
##             x1    x2    x3         
## Minimize    -3    -4    -3         
## R1           6     2     4  <=  150
## R2           1     1     6  >=    0
## R3           4     5     4   =   40
## Kind       Std   Std   Std         
## Type      Real   Int   Int         
## Upper        5     1   Inf         
## Lower       -5     0     0
solve(lps.model)
## [1] 0
get.objective(lps.model)
## [1] -30.25
get.variables(lps.model)
## [1] 4.75 1.00 4.00
get.constraints(lps.model)
## [1] 46.50 29.75 40.00

The next example limits all three variable to be dichotomous (binary).

lps.model <- make.lp(0, 3)
add.constraint(lps.model, c(1, 2, 4), "<=", 5)
add.constraint(lps.model, c(1, 1, 6), ">=", 2)
add.constraint(lps.model, c(1, 1, 1), "=",  2)
set.objfn(lps.model, c(2, 1, 2))

set.type(lps.model, 1, "binary")
set.type(lps.model, 2, "binary")
set.type(lps.model, 3, "binary")

print(lps.model)
## Model name: 
##            C1   C2   C3       
## Minimize    2    1    2       
## R1          1    2    4  <=  5
## R2          1    1    6  >=  2
## R3          1    1    1   =  2
## Kind      Std  Std  Std       
## Type      Int  Int  Int       
## Upper       1    1    1       
## Lower       0    0    0
solve(lps.model)
## [1] 0
get.variables(lps.model)
## [1] 1 1 0

3.4 Quadratic Programming (QP)

QP can be used for second order (quadratic) objective functions, but the constraint functions are still linear combination of the domain variables. More details are available in the Pracma quadprog documentation.

A matrix formulation of the problem can be expressed as minimizing an objective function: \[f(X) = \frac{1}{2} X^T D X - d^T X, \]

where \(X\) is a vector \([x_1, x_2, \ldots, x_n]^T\), \(D\) is the matrix of weights of each association pair, \(x_i, x_j\), and \(d\) are the weights for each individual feature, \(x_i\). The \(\frac{1}{2}\) coefficient ensures that the weights matrix \(D\) is symmetric and each \(x_i, x_j\) pair is not double-counted. This cost function is subject to the constraints: \[A X\ \ [ = | \geq ]\ \ b, \] where the first \(k\) constrains may represent equalities (\(=\)) and the remaining ones are inequalities (\(\geq\)), and \(b\) is the constraints right hand size (RHS) constant vector.

Here is an example of a QP objective function and its R optimization: \[f(x_1, x_2, x_3) = x_1^2 - x_1x_2 + x_2^2 + x_2 x_3 + x_3^2 -5 x_2 + 3 x_3.\]

We can rewrite \(f\) in a slightly modified form to explicitly specify the parameters (\(D\) and \(d\)): \[f(x_1, x_2, x_3) = \frac{1}{2}\underbrace{(2 x_1^2 - 2 x_1x_2 + 2 x_2^2 + 2 x_2 x_3 + 2 x_3^2)}_{X^TDX} - \underbrace{(5 x_2 - 3 x_3)}_{d^TX}.\]

The symmetric positive definite matrix \(D\) is: \[D=\left (\begin{array}{rcl} 2 & -1 & 0 \\ -1 & 2 & 1 \\ 0 & 1 & 2 \end{array} \right ).\]

Subject to the following constraints (\(A_{eq}\ X=b_{eq}\) and \(AX\leq b\):

\[\begin{array}{rcl} -4x_1 -3x_2 = -8 \\ 2x_1 + x_2 = 2 \\ 2x_2 - x_3 \geq 0 \end{array} .\]

library(quadprog)

Dmat       <- matrix(c( 2, -1, 0, 
                       -1, 2, 1, 
                        0, 1, 2), 3, 3)
dvec       <- c(0, 5, -3)
Amat       <- matrix(c(-4, -3, 0, 
                        2, 1, 0, 
                        0, 2, -1), 3, 3)
bvec       <- c(-8, 2, 0)
n.eqs      <- 2 # the first two constraints are equalities
sol <- solve.QP(Dmat, dvec, Amat, bvec=bvec, meq=2)
sol$solution # get the (x1, x2, x3) point of minimum
## [1] -1.0  4.0 -3.5
sol$value  # get the actual cost function minimum
## [1] -11.25

The minimum value, \(49\), of the QP solution is attained at \(x_1=-1, x_2=4, x_3=8\). Let’s double check the solution:

ef <- function(X, D, d, A) {
  1/2 * t(X) %*% D %*% X - t(d) %*% X
}
ef_man <- function(x) {
   (1/2)*(2*x[1]^2  - 2*x[1]*x[2] + 2*x[2]^2 +2* x[2]*x[3] + 2*x[3]^2) - (5*x[2] - 3*x[3]) 
}
X1 <- c(-1.0,  4.0, -3.5); X2 <- c(-2.5, 6, 0)
ef(X1, Dmat, dvec, Amat); ef(X2, Dmat, dvec, Amat)
##        [,1]
## [1,] -11.25
##       [,1]
## [1,] 27.25
ef_man(X1); ef_man(X2)
## [1] -11.25
## [1] 27.25

We can also use Wolfram Alpha to validate the solution.

When \(D\) is a positive definitive matrix, i.e., \(X^T D X \gt 0\), for all non-zero \(X\), the QP problem may be solved in polynomial time. Otherwise, the QP problem is NP-hard. In general, even if \(D\) has only one negative eigenvalue, the QP problem is still NP-hard.

The QP function solve.QP() expects a positive definitive matrix \(D\).

4 General Non-linear Optimization

The package Rsolnp provides a special function solnp(), which solves the general non-linear programming problem:

\[\min_x { f(x) }\] subject to: \[g(x)=0\] \[l_h \leq h(x) \leq u_h\] \[l_x \leq x \leq u_x, \] where \(f(x), g(x), h(x)\) are all smooth functions.

4.1 Dual problem optimization

Duality in math really just means having two complementary ways to think about an optimization problem. The primal problem represents an optimization challenge in terms of the original decision variable \(x\). The dual problem, also called Lagrange dual, searches for a lower bound of a minimization problem or an upper bound for a maximization problem. In general, the primal problem may be difficult to analyze, or solve directly, because it may include non-differentiable penalty terms, e.g., \(l_1\) norms, recall LASSO/Ridge regularization in Chapter 17. Hence, we turn to the corresponding Lagrange dual problem where the solutions may be more amenable, especially for convex functions, that satisfy the following inequality: \[f(\lambda x +(1-\lambda)y)\leq \lambda f(x) + (1-\lambda)f(y).\]

4.1.1 Motivation

Suppose we want to borrow money, \(x\), from a bank, or lender, and \(f(x)\) represents the borrowing cost to us. There are natural “design constraints” on money lending. For instance, there may be a cap in the interest rate, \(h(x)\leq b\), or we can have many other constraints on the loan duration. There may be multiple lenders, including self-funding, that may “charge” us \(f(x)\) for lending us \(x\). Lenders goals are to maximize profits. Yet, they can’t charge you more than the prime interest rate plus some premium based on your credit worthiness. Thus, for a given fixed \(\lambda\), a lender may make us an offer to lend us \(x\) aiming to minimize \[f(x)+\lambda \times h(x).\]

If this cost is not optimized, i.e., minimized, you may be able to get another loan \(y\) at lower cost \(f(y)<f(x)\), and the funding agency loses your business. If the cost/objective function is minimized, the lender may maximize their profit by varying \(\lambda\) and still get us to sign on the loan.

The customer’s strategy represents a game theoretic interpretation the primal problem, whereas the dual problem corresponds to the strategy of the lender.

In solving complex optimization problems duality is equivalent to existence of a saddle point of the Lagrangian. For convex problems, the double-dual is equivalent to the primal problem. In other words, applying the convex conjugate (Fenchel transform) twice returns the convexification of the original objective function, which is the same as the original function in most situations.

The dual of a vector space is defined as the space of all continuous linear functionals on that space. Let \(X=R^n\), \(Y=R^m\), \(f:X\longrightarrow R\), and \(h:X\longrightarrow Y\). Consider the following optimization problem:

\[\begin{array}{lcl} \min_x {f(x)} \\ \text{subject to} \\ x \in X \\ h(x)\leq 0. \end{array} \]

Then, this primal problem has a corresponding dual problem:

\[\begin{array}{lcl} \min_{\lambda} { \inf_{x \in X} {\left ( f(x) + \langle \lambda, h(x) \rangle \right )}} \\ \text{subject to} \\ \lambda_i \geq 0, \forall 0\leq i\leq m. \end{array}\]

The parameter \(\lambda\in R^m\) is an element of the dual space of \(Y\), i.e., \(Y^*\), since the inner product \(\langle \lambda, h(x) \rangle\) is a continuous linear functional on \(Y\). Here \(Y\) is finite dimensional and by the Riesz representation theorem \(Y^*\) is isomorphic to \(Y\). Note that in general, for infinite dimensional spaces, \(Y\) and \(Y^*\) are not guaranteed to be isomorphic.

4.1.2 Example 1: Linear example

Minimize \(f(x, y)=5x-3y\), constrained by \(x^2+y^2=136\), which has a minimum value of \(-68\) attained at \((-10, 6)\). We will use the Rsolnp::solnp() method in this example.

# install.packages("Rsolnp")
library(Rsolnp)

fn1 <- function(x) { # f(x, y) = 5x-3y
  5*x[1] - 3*x[2]
}

# constraint z1: x^2+y^2=136
eqn1 <- function(x) { 
  z1=x[1]^2 + x[2]^2
  return(c(z1))
}
constraints = c(136)

x0 <- c(1, 1) # setup initial values
sol1 <- solnp(x0, fun = fn1, eqfun = eqn1, eqB = constraints)
## 
## Iter: 1 fn: 37.4378   Pars:  30.55472 38.44528
## Iter: 2 fn: -147.9181     Pars:  -6.57051 38.35517
## Iter: 3 fn: -154.7345     Pars:  -20.10545  18.06907
## Iter: 4 fn: -96.4033  Pars:  -14.71366   7.61165
## Iter: 5 fn: -72.4915  Pars:  -10.49919   6.66517
## Iter: 6 fn: -68.1680  Pars:  -10.04485   5.98124
## Iter: 7 fn: -68.0006  Pars:   -9.99999   6.00022
## Iter: 8 fn: -68.0000  Pars:  -10.00000   6.00000
## Iter: 9 fn: -68.0000  Pars:  -10.00000   6.00000
## solnp--> Completed in 9 iterations
sol1$values[10]  # sol1$values contains all steps of the iteration algorithm and the last value is the min value
## [1] -68
sol1$pars
## [1] -10   6

4.1.3 Example 2: Quadratic example

Minimize \(f(x, y) = 4x^2 + 10y^2 + 5\) subject to the inequality constraint \(0\leq x^2+y^2 \leq 4\), which has a minimum value of \(5\) attained at the origin \((0, 0)\).

fn2 <- function(x) {  # f(x, y) = 4x^2 + 10y^2 +5
  4*x[1]^2 + 10*x[2]^2 +5
}

# constraint z1: x^2+y^2 <= 4
ineq2 <- function(x) { 
  z1=x[1]^2 + x[2]^2
  return(c(z1))
}

lh <- c(0)
uh <- c(4)

x0 = c(1, 1) # setup initial values
sol2 <- solnp(x0, fun = fn2, ineqfun = ineq2, ineqLB = lh, ineqUB=uh)
## 
## Iter: 1 fn: 7.8697    Pars:  0.68437 0.31563
## Iter: 2 fn: 5.6456    Pars:  0.39701 0.03895
## Iter: 3 fn: 5.1604    Pars:  0.200217 0.002001
## Iter: 4 fn: 5.0401    Pars:  0.10011821 0.00005323
## Iter: 5 fn: 5.0100    Pars:  0.0500592618 0.0000006781
## Iter: 6 fn: 5.0025    Pars:   0.02502983706 -0.00000004425
## Iter: 7 fn: 5.0006    Pars:   0.01251500215 -0.00000005034
## Iter: 8 fn: 5.0002    Pars:   0.00625757145 -0.00000005045
## Iter: 9 fn: 5.0000    Pars:   0.00312915970 -0.00000004968
## Iter: 10 fn: 5.0000   Pars:   0.00156561388 -0.00000004983
## Iter: 11 fn: 5.0000   Pars:   0.0007831473 -0.0000000508
## Iter: 12 fn: 5.0000   Pars:   0.00039896484 -0.00000005045
## Iter: 13 fn: 5.0000   Pars:   0.00021282342 -0.00000004897
## Iter: 14 fn: 5.0000   Pars:   0.00014285437 -0.00000004926
## Iter: 15 fn: 5.0000   Pars:   0.00011892066 -0.00000004976
## solnp--> Completed in 15 iterations
sol2$values
##  [1] 19.000000  7.869675  5.645626  5.160388  5.040095  5.010024  5.002506
##  [8]  5.000627  5.000157  5.000039  5.000010  5.000002  5.000001  5.000000
## [15]  5.000000  5.000000
sol2$pars
## [1]  1.189207e-04 -4.976052e-08

There are a number of parameters that control the solnp procedure. For instance, TOL defines the tolerance for optimality (which impacts the convergence) and trace=0 turns off the printing of the results at each iteration.

ctrl <- list(TOL=1e-15, trace=0)
sol2 <- solnp(x0, fun = fn2, ineqfun = ineq2, ineqLB = lh, ineqUB=uh, control=ctrl)
sol2$pars
## [1]  1.402813e-08 -5.015532e-08

4.1.4 Example 3: More complex non-linear optimization

Let’s try to minimize \[f(X) = -x_1 x_2 x_3\] subject to

\[\begin{array}{rcl} 4x_1 x_2 + 2x_2 x_3 + 2x_3x_1 = 100\\ 1 \leq x_i \leq 10, i = 1, 2, 3 \end{array}\]

fn3 <- function(x, ...){
  -x[1]*x[2]*x[3]
}

eqn3 <- function(x, ...){
    4*x[1]*x[2]+2*x[2]*x[3]+2*x[3]*x[1]
}
constraints3 = c(100)

lx <- rep(1, 3)
ux <- rep(10, 3)

pars <- c(2, 1, 7) # setup: Try alternative starting-parameter vector (pars)
ctrl <- list(TOL=1e-6, trace=0)
sol3 <- solnp(pars, fun=fn3, eqfun=eqn3, eqB = constraints3, LB=lx, UB=ux, control=ctrl)
sol2$values
##  [1] 19.000000  7.869675  5.645626  5.160388  5.040095  5.010024  5.002506
##  [8]  5.000626  5.000157  5.000039  5.000010  5.000002  5.000001  5.000000
## [15]  5.000000  5.000000  5.000000  5.000000  5.000000  5.000000  5.000000
## [22]  5.000000  5.000000  5.000000  5.000000  5.000000  5.000000  5.000000
## [29]  5.000000
sol3$pars
## [1] 2.886751 2.886751 5.773505

The non-linear optimization is sensitive to the initial parameters (pars), especially when the objective function is not smooth or if there are many local minima. The function gosolnp() may be employed to generate initial (guesstimates of the) parameters.

4.1.5 Example 4: Another linear example

Let’s try another minimization of a linear objective function \(f(x, y, z) = 4y-2z\) subject to

\[\begin{array}{rcl} 2x-y-z=2\\ x^2+y^2=1. \end{array}\]

fn4 <- function(x)  # f(x, y, z) = 4y-2z
{
  4*x[2] - 2*x[3]
}

# constraint z1: 2x-y-z  = 2 
# constraint z2: x^2+y^2 = 1
eqn4 <- function(x){ 
  z1=2*x[1] - x[2] - x[3]
  z2=x[1]^2 + x[2]^2
  
  return(c(z1, z2))
}
constraints4 <- c(2, 1)

x0 <- c(1, 1, 1)
ctrl <- list(trace=0)
sol4 <- solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl)
sol4$values
## [1]   2.000000  -5.078795 -11.416448  -5.764047  -3.584894  -3.224531
## [7]  -3.211165  -3.211103  -3.211103
sol4$pars
## [1]  0.55470019 -0.83205030 -0.05854932

The linear algebra and matrix computing Chapter 4 and the regularized parameter estimation in Chapter 17 provide additional examples of least squares parameter estimation, regression and regularization.

5 Manual vs. Automated Lagrange Multiplier Optimization

Let’s manually implement the Lagrange Multipliers procedure and then compare the results to some optimization examples obtained by automatic R function calls. The latter strategies may be more reliable, efficient, flexible, and rigorously validated. The manual implementation provides a more direct and explicit representation of the actual optimization strategy.

We will test a simple example of an objective function: \[f(x, y, z) = 4y-2z + x^2+y^2, \] subject to two constraints:

\[\begin{array}{rcl} 2x-y-z = 2 \\ x^2+y^2 +z = 1. \end{array}\]

The R package numDeriv may be used to calculate numerical approximations of partial derivatives.

# define the main Lagrange Multipliers Optimization strategy from scratch
require(numDeriv)
## Loading required package: numDeriv
lagrange_multipliers <- function(x, f, g) {  # Objective/cost function, f, and constrains, g
    k <- length(x)
    l <- length(g(x))
  
    # Compute the derivatives
    grad_f <- function(x) { grad(f, x) }
  
    # g, representing multiple constrains, is a vector-valued function: its first derivative is a matrix
    grad_g <- function(x) {  jacobian(g, x) }
    
    # The Lagrangian is a scalar-valued function:
    #   L(x, lambda) = f(x) - lambda * g(x)
    # whose first derivative roots give the optimal solutions
    #   h(x, lambda) = c( f'(x) - lambda * g'(x), - g(x) ).
    h <- function(y) {
        c(grad_f(y[1:k]) - t(y[-(1:k)]) %*% grad_g(y[1:k]), -g(y[1:k]))
    }
    
    # To find the roots of the first derivative, we can use Newton's method:
    # iterate  y <- y - h'(y)^{-1} h(y)  until certain convergence criterion is met
    # e.g., (\delta <= 1e-6)
    grad_h <- function(y) { jacobian(h, y) }
    
    y <- c(x, rep(0, l))
    previous <- y + 1
    while(sum(abs(y-previous)) > 1e-6) {
        previous <- y
        y <- y - solve( grad_h(y), h(y) )
    }
    y[1:k]
}

x <- c(0, 0, 0)

# Define the objective cost function
fn4 <- function(x)  # f(x, y, z) = 4y-2z + x^2+y^2
{
  4*x[2] - 2*x[3] + x[1]^2+ x[2]^2
  #sum(x^2)
}
# check the derivative of the objective function
grad(fn4, x)
## [1]  0  4 -2
# define the domain constraints of the problem
# constraint z1: 2x-y-z  = 2 
# constraint z2: x^2+y^2 +z = 1
eqn4 <- function(x){ 
  z1=2*x[1] - x[2] - x[3] -2
  z2=x[1]^2 + x[2]^2 + x[3] -1
  return(c(z1, z2))
}

# Check the Jacobian of the constraints
jacobian(eqn4, x)
##      [,1] [,2] [,3]
## [1,]    2   -1   -1
## [2,]    0    0    1
# Call the Lagrange-multipliers solver

# check one step of the algorithm
k <- length(x)
l <- length(eqn4(x)); 
h <- function(x) {
    c(grad(fn4, x[1:k]) - t(-x[(1:2)]) %*% jacobian(eqn4, x[1:k]), -eqn4(x[1:k]))
}
jacobian(h, x)
##      [,1] [,2]          [,3]
## [1,]    4    0  0.000000e+00
## [2,]   -1    2  5.482583e-15
## [3,]   -1    1  0.000000e+00
## [4,]   -2    1  1.000000e+00
## [5,]    0    0 -1.000000e+00
# Lagrange-multipliers solver for f(x, y, z) subject to g(x, y, z)
lagrange_multipliers(x, fn4, eqn4)
## [1]  0.3416408 -1.0652476 -0.2514708

Now, let’s double-check the above manual optimization results against the automatic solnp solution minimizing \[f(x, y, z) = 4y-2z + x^2+y^2\] subject to:

\[\begin{array}{rcl} 2x-y-z=2\\ x^2+y^2=1. \end{array}\]

library(Rsolnp)
fn4 <- function(x)  # f(x, y, z) = 4y-2z + x^2+y^2
{
  4*x[2] - 2*x[3] + x[1]^2+ x[2]^2
}

# constraint z1: 2x-y-z  = 2 
# constraint z2: x^2+y^2 +z = 1
eqn4 <- function(x){ 
  z1=2*x[1] - x[2] - x[3]
  z2=x[1]^2 + x[2]^2 + x[3]
  return(c(z1, z2))
}
constraints4 <- c(2, 1)

x0 <- c(1, 1, 1)
ctrl <- list(trace=0)
sol4 <- solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl)
sol4$values
## [1]  4.0000000 -0.1146266 -5.9308852 -3.7035124 -2.5810141 -2.5069444
## [7] -2.5065779 -2.5065778 -2.5065778

The results of both (manual and automated) experiments identifying the optimal \((x, y, z)\) coordinates minimizing the objective function \(f(x, y, z) = 4y-2z + x^2+y^2\) are in agreement.

  • Manual optimization: lagrange_multipliers(x, fn4, eqn4): 0.3416408 -1.0652476 -0.2514708.
  • Automated optimization: solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl): 0.3416408 -1.0652476 -0.2514709.

6 Data Denoising

Suppose we are given \(x_{noisy}\) with \(n\) noise-corrupted data points. The noise may be additive (\(x_{noisy}\sim x +\epsilon\)) or not additive. We may be interested in denoising the signal and recovering a version of the original (unobserved) dataset \(x\), potentially as a smother representation of the original (uncorrupted) process. Smoother signals suggest less (random) fluctuations between neighboring data points.

One objective function we can design to denoise the observed signal, \(x_{noisy}\), may include a fidelity term and a regularization term, see the regularized linear modeling, Chapter 17.

Total variation denoising assumes that for each time point \(t\), the observed noisy data \[\underbrace{x_{noisy}(t)}_{\text{observed signal}} \sim \underbrace{x(t)}_{\text{native signal}} + \underbrace{\epsilon(t)}_{\text{random noise}}.\]

To recover the native signal, \(x(t)\), we can optimize (\(\arg\min_x{f(x)}\)) the following objective cost function:

\[f(x) = \underbrace{\frac{1}{2} \sum_{t=1}^{n-1} {\|y(t) - x_{noisy}(t)\| ^2}}_{\text{fidelity term}} + \underbrace{\lambda \sum_{t=2}^{n-1} | x(t) - x(t-1)|}_{\text{regularization term}}, \]

where \(\lambda\) is the regularization smoothness parameter, \(\lambda \rightarrow 0 \implies y \rightarrow x_{noisy}\). Minimizing \(f(x)\) provides a minimum total-variation solution to the data denoising problem.

Below is an example illustrating total variation (TV) denoising using a simulated noisy dataset. We start by generating an oscillatory noisy signal. Then, we compute several smoothed versions of the noisy data, plot the initial and smoothed signals, define and optimize the TV denoising objective function, which is a mixture of a fidelity term and a regularization term.

n  <- 1000
x <- rep(0, n)
xs <- seq(0, 8, len=n)   # seq(from = 1, to = 1, length)
noise_level = 0.3  # sigma of the noise, try varying this noise-level

# here is where we add the zero-mean noise
set.seed(1234)

x_noisy <- function (x) { 
  sin(x)^2/(1.5+cos(x)) + rnorm(length(x), 0, noise_level)
}

# initialize the manual denoised signal
x_denoisedManu <- rep(0, n)

df <- as.data.frame(cbind(xs, x_noisy(xs)))
# loess fit a polynomial surface determined by numerical predictors,
# using local fitting
poly_model1 <- loess(x_noisy(xs) ~ xs, span=0.1, data=df) # tight model
poly_model2 <- loess(x_noisy(xs) ~ xs, span=0.9, data=df) # smoother model
# To see some of numerical results of hte model-fitting:
# View(as.data.frame(cbind(xs, x_noisy, predict(poly_model1))))

plot(xs, x_noisy(xs), type='l')
lines(xs, poly_model1$fitted, col="red", lwd=2)
lines(xs, poly_model2$fitted, col="blue", lwd=3)

Next, let’s initiate the parameters, define the objective function and optimize it, i.e., estimate the parameters that minimize the cost function as a mixture of fidelity and regularization terms.

# initialization of parameters
betas_0 <- c(0.3, 0.3, 0.5, 1)
betas <- betas_0

# Denoised model
x_denoised <- function(x, betas) {
  if (length(betas) != 4) {
    print(paste0("Error!!! length(betas)=", length(betas), " != 4!!! Exiting ..."))
    break();
  }  
  # print(paste0("   .... betas = ", betas, "...\n"))
  # original noise function definition: sin(x)^2/(1.5+cos(x))
  return((betas[1]*sin(betas[2]*x)^2)/(betas[3]+cos(x)))
} 

library(Rsolnp)

fidelity <- function(x, y) {
  sqrt((1/length(x)) * sum((y - x)^2))
}

regularizer <- function(betas) {
  reg <- 0
  for (i in 1:(length(betas-1))) {
      reg <- reg + abs(betas[i])
  }
  return(reg)
}

# Objective Function
objective_func <- function(betas)  {
# f(x) = 1/2 * \sum_{t=1}^{n-1} {|y(t) - x_{noisy}(t)\|^2}} + \lambda * \sum_{t=2}^{n-1} | x(t) - x(t-1)|
  fid <- fidelity(x_noisy(xs), x_denoised(xs, betas)) 
  reg <- abs(betas[4])*regularizer(betas)
  error <- fid + reg
  # uncomment to track the iterative optimization state
  # print(paste0(".... Fidelity =", fid, " ... Regularizer = ", reg, " ... TotalError=", error))
  # print(paste0("....betas=(",betas[1],",",betas[2],",",betas[3],",",betas[4],")"))
  return(error)
}

# inequality constraint forcing the regularization parameter lambda=beta[4]>0
inequalConstr <- function(betas){
   betas[4]
}
inequalLowerBound <- 0; inequalUpperBound <- 100

# should we list the value of the objective function and the parameters at every iteration (default trace=1; trace=0 means no interim reports)
# constraint problem 
# ctrl <- list(trace=0, tol=1e-5)  ## could specify: outer.iter=5, inner.iter=9)
set.seed(121)
sol_lambda <- solnp(betas_0, fun = objective_func, ineqfun = inequalConstr, ineqLB = inequalLowerBound, ineqUB = inequalUpperBound, control=ctrl)

# unconstraint optimization
# ctrl <- list(trace=1, tol=1e-5)  ## could specify: outer.iter=5, inner.iter=9)
# sol_lambda <- solnp(betas_0, fun = denoising_func, control=ctrl)

# suppress the report of the functional values (too many)
# sol_lambda$values

# report the optimal parameter estimates (betas)
sol_lambda$pars
## [1] 2.5649689 0.9829681 1.7605481 0.9895268
# Reconstruct the manually-denoised signal using the optimal betas
betas <- sol_lambda$pars
x_denoisedManu <- x_denoised(xs, betas)
print(paste0("Final Denoised Model:", betas[1], "*sin(", betas[2], "*x)^2/(", betas[3], "+cos(x)))"))
## [1] "Final Denoised Model:2.56496893433154*sin(0.982968123322892*x)^2/(1.76054814253387+cos(x)))"
plot(x_denoisedManu)

Finally, we can validate our manual denoising protocol against the automated TV denoising using the R package tvd.

# install.packages("tvd")
library("tvd")
## Warning: package 'tvd' was built under R version 3.5.1
lambda_0 <- 0.5
x_denoisedTVD <- tvd1d(x_noisy(xs), lambda_0, method = "Condat")
# lambda_o is the total variation penalty coefficient
# method is a string indicating the algorithm to use for denoising. 
# Default method is "Condat"

# plot(xs, x_denoisedTVD, type='l')
plot(xs, x_noisy(xs), type='l')
lines(xs, poly_model1$fitted, col="red", lwd=2)
lines(xs, poly_model2$fitted, col="blue", lwd=3)
lines(xs, x_denoisedManu, col="pink", lwd=4)
lines(xs, x_denoisedTVD, col="green", lwd=5)
# add a legend
legend("bottom", c("x_noisy", "poly_model1$fitted", "poly_model2$fitted", "x_denoisedManu", "x_denoisedTVD"), col=c("black", "red", "blue", "pink", "green"), lty=c(1,1, 1,1), cex=0.7, lwd= c(1,2,3,4,5), title="TV Denoising")

7 Sparse Matrices

In Chapter 4 we saw matrix computing. However, we need a new data structure to efficiently store Big but Sparse matrices, whose elements are mostly trivial (zero). Dense Matrices have the majority of their elements be non-zero, but sparse matrices represent mostly trivial elements. The sparcity of a matrix is the proportion of non-zero elements. Even if the original data is not sparse, some data preprocessing may result is sparse data structures. Various techniques to store and process sparse matrices are included in the R package Matrix.

Here is an example of a large \(15000\times 15000\) but sparse diagonal matrix (\(SM\)) compared to a standard diagonal matrix (\(D\)) of the same size:

# install.packages("plyr")
library("Matrix")
memory.limit(50000) # increase the RAM allocation
## [1] 50000
n = 15000
SM = sparseMatrix(1:n, 1:n, x = 1)
D = diag(1, n, n)

# Check the sizes of the SM and D:
object.size(SM); object.size(D)
## 241504 bytes
## 1800000216 bytes
# Try to invert each of the matrices to see that the SM arithmetic is much more efficient
solve(SM)[1:10, 1:10] # skip this is very intensive: solve(D)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                          
##  [1,] 1 . . . . . . . . .
##  [2,] . 1 . . . . . . . .
##  [3,] . . 1 . . . . . . .
##  [4,] . . . 1 . . . . . .
##  [5,] . . . . 1 . . . . .
##  [6,] . . . . . 1 . . . .
##  [7,] . . . . . . 1 . . .
##  [8,] . . . . . . . 1 . .
##  [9,] . . . . . . . . 1 .
## [10,] . . . . . . . . . 1

The size of the matrix \(D\) is \(10,000\) times larger than \(SM\) despite the fact that both represent \(n\times n\) matrices. The difference is that information is sorted differently in sparse matrices, where identical values that appear multiple times are only stored once along with pointers to the matrix locations with the same value.

8 Parallel Computing

Parallel computing is useful for many processes where the algorithm can be partitioned into smaller jobs that can be run in parallel on different compute cores and results integrated at the end, e.g., Monte-Carlo simulations, machine learning tasks, separable transformations. The user creates a virtual cluster of compute nodes where each core runs independently and the resulting information pieces are aggregated at the end.

Let’s try a simple calculation of the mean:

n = 200000
sapply(1:n, mean)[1:20]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Let’s create a private compute cluster using 4 of the available 8 cores and rerun the same script using parallel processing via the R parallel package.

library("parallel")

# check the number of available cores:
detectCores()
## [1] 8
# construct a cluster of 4 cores
clust = makeCluster(4)

# swaping sapply() with parSapply():
parSapply(clust, 1:n, mean)[1:20]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
stopCluster(clust)

Once done, we should always stop the clusters and release the system resources to avoid memory leaks.

Most computers include multiple processors which can be utilized to optimize the calculations and speed-up performance. Never request, or use, all cores, as this will halt the machine, and do not expect to achieve performance improvement directly proportional to the number of cores used, as there is multi-thread communication overhead.

9 Foundational Methods for Function Optimization

Function optimization may not always be possible, i.e., finding exact minima or maxima of functions are not always analytically tractable. However, there are iterative algorithms to compute approximate solutions to such function optimization problems.

9.1 Basics

Let’s recall the following Newtonian principles for a given objective or cost function* \(f: \Re^n \to \Re\), which we want to optimize, i.e., we are looking for solution(s) \(x^*= \arg\min_{x} \; f(x)\).

  • Maximizing \(f\) is the same as minimizing \(-f\), as \(\arg\max_{x} \; f(x) = \arg\min_{x} \;-f(x)\).

  • If \(\psi\) is strictly increasing (e.g., \(\log\), \(x^2\), \(\exp(x)\)), then \(\arg\min_{x} \; f(x) = \arg\min_{x}\; \psi(f(x))\).

  • The accuracy (bias) measures how close we get to the optimum point \(x^*\).

  • The convergence speed reflects how quickly (in terms of the number of iterations) we get towards \(x^*\).

  • The computational complexity captures how expensive is it to perform a single iteration.

  • For 1D functions (\(n=1\)), if \(f\) is smooth, then \(x^\ast\) is a local optimum implies that \(f'(x^\ast) = 0\). The sign of the second derivative determines if the optimum is minimum (\(f''(x^\ast)>0\)) or maximum (\(f''(x^*)<0\)).

  • In \(\Re^n\), for smooth \(f: \Re^n \to \Re\), then at (local) optimal points, \(\nabla f(x^\ast) = 0\), where the gradient of \(f\) is the vector of all partial derivatives:\(\nabla f(x) =\left(\begin{array}{c} \frac{\partial f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \end{array}\right)\).

  • In higher dimensions, the second derivative is represented as the Hessian matrix, \(\nabla^2 f\), which is defined at each point \(x\) by \(\nabla^2 f(x)= [\nabla^2 f(x)]_{ij} = \frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). At a local minimum \(x^*\), the Hessian is positive definite (\(v^T \nabla^2 f(x^\ast) v > 0, \;\;\; \text{for all $v$}\)), for minima, or negative-definite (\(v^T \nabla^2 f(x^\ast) v < 0, \;\;\; \text{for all $v$}\)), for maxima.

9.2 Gradient Descent

Let’s recall the relation between function changes (rise), relative to changes in the argument (run), gradients, and derivatives.

The derivative of a function is defined as the limit of the rise over the run, \[ f'(x_0) = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}}.\]

Thus, when \(x\) is close to \(x_0\), a first-order (linear) approximation of the function at \(x\) is obtained by \(f(x) \approx f(x_0) + f'(x_0)(x-x_0)\). This linear approximation allows us to approximate (locally) the function extrema by moving down the slope, or gradient, \(f'(x_0)\). The higher-dimensional analogue is similar, for \(x \sim x_0 \in \Re^n\), \(f(x) \approx f(x_0) + \nabla f(x_0)^T (x-x_0)\). Effectively, all smooth functions look linear in a small neighborhood around each domain point (including the extrema points). In addition, the gradient \(\nabla f(x_0)\) points in the direction of fastest ascent. Therefore, to optimize \(f\), we move in direction given by \(-\nabla f(x_0)\). Recall that the inner product of two vectors, \(\langle a,b \rangle\), is defined by \(\langle a,b \rangle = a^T b = \sum_{i=1}^n a_i b_i\).

9.2.1 Pseudo algorithm

Here is a simple yes illustrative algorithm for minimizing \(f\) by gradient descent, by iteratively moving in the direction of the negative gradient.

  1. (Initialization): Start with initial guess \(x_{(0)}\), a step size \(\eta\) (aka learning rate), and a threshold (tolerance level),
  2. (Iterative Update) For \(k=1,2,3,\ldots\):
    • Compute the gradient \(\nabla f(x_{(k-1)})\)
    • If gradient is close to zero (stopping criterion threshold), then stop, otherwise continue
    • Update \(x_{(k)} = x_{(k-1)} - \eta \nabla f(x_{(k-1)})\),
  3. Return final \(x_{(k)}\) as approximate solution \(x*\).

9.2.2 Example

Let’s demonstrate a simple example performing gradient descent optimization.

# install.packages("numDeriv")
library(numDeriv)         # to access the gradient calculation function: grad()

gradDescent = function(f, x0, max.iter=500, stepSize=0.01, stoppingCriterion=0.001) {
  n = length(x0)
  xmat = matrix(0, nrow=n, ncol=max.iter)
  xmat[,1] = x0
  
  for (k in 2:max.iter) {
    # Calculate the current gradient
    currentGrad = grad(f, xmat[,k-1]) 
    
    # Check Stopping criterion
    if (all(abs(currentGrad) < stoppingCriterion)) {
      k = k-1; break
    }
    
    # Move in the opposite direction of the gradient
    xmat[,k] = xmat[,k-1] - stepSize * currentGrad
  }

  xmat = xmat[,1:k]     # remove the initial guess (x0)
  # return the final solution, the solution path, min.value, and the number of iterations
  return(list(x=xmat[,k], xmat=xmat, min.val=f(xmat[,k]), k=k))
}

Let’s try one of the earlier examples, \(\min f(x_1, x_2) = \min \{(x_1-3)^2 + (x_2+4)^2\}\), which we solves using optim():

# Define the function
f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 }
initial_x <- c(0, -1)
x_optimal <- optim(initial_x, f, method="CG") # performs minimization
x_min <- x_optimal$par
# x_min contains the domain values where the (local) minimum is attained
x_min   # critical point/vector
## [1]  3 -4
x_optimal$value 
## [1] 8.450445e-15
# call the gradient descent optimizer
x0 = c(0, 0)
optim.f = gradDescent(f, x0)
# local min x
optim.f$x
## [1]  2.999626 -3.999502
# optimal value
optim.f$min.val
## [1] 3.882821e-07
# number of iterations
optim.f$k
## [1] 446

Let’s make a 3D plot to visualize the solution path.

surface = function(f, from.x=0, to.x=1, from.y=0, to.y=1, n.x=30, n.y=30, theta=5, phi=80, ...) {
  # Build the 2d grid
  x.seq = seq(from=from.x,to=to.x,length.out=n.x)
  y.seq = seq(from=from.y,to=to.y,length.out=n.y)
  plot.grid = expand.grid(x.seq,y.seq)
  z.vals = apply(plot.grid,1,f)
  z.mat = matrix(z.vals,nrow=n.x)
  
  # Plot with the persp function 
  orig.mar = par()$mar # Save the original margins
  par(mar=c(1,1,1,1))  # Make the margins small
  r = persp(x.seq,y.seq,z.mat,theta=theta,phi=phi,...)
  par(mar=orig.mar) # Restore the original margins
  invisible(r)
}

# Draw our simple function in 2d, and our gradient descent path on top
r = surface(f,from.x=-2,to.x=5,from.y=-6,to.y=5,theta=45,phi=-5,xlab="x1",ylab="x2",zlab="f")
points(trans3d(x0[1],x0[2],f(x0),r),col="red",cex=2)
lines(trans3d(optim.f$xmat[1,],optim.f$xmat[2,],apply(optim.f$xmat,2,f),r),lwd=4,col="red")
points(points(trans3d(optim.f$x[1],optim.f$x[2],f(optim.f$x),r), col="gray",cex=2))

9.2.3 Summary of Gradient Descent

There are pros and cons of using gradient descent optimization

Advantages of Gradient Descent:

  • Simple and intuitive
  • Easy to implement
  • Iterations are usually computationally efficient (involving estimation of the gradient)

Disadvantages:

  • The algorithm may be slow and may get trapped into left-right alternating patterns when \(\nabla f(x)\) are of very different sizes
  • It may require a long time to converge to the optimum
  • For some functions, getting close to the optimum (\(x_{(k)} \sim x^*\), where \(f(x_{(k)})-f(x*) \leq \epsilon\), may require large number of iterations, \(k\approx 1/\epsilon\). However, for smoother functions this may require \(k\approx \log(1/\epsilon)\) iterations.

9.3 Convexity

There are substantial differences between convex function optimization over convex sets and non-convex function optimization over non-convex domains. The convexity assumptions make optimization problem much easier than the general non-convex space optimization problems since (1) local extrema must be global extrema in convex spaces, and (2) first-order conditions are sufficient conditions for optimality.

Let’s reexamine our simple quadratic convex optimization problem (\(\min f(x_1, x_2) = \min \{(x_1-3)^2 + (x_2+4)^2\}\)). We can plot a number of gradient descents pathways starting with different initial conditions. If they all merge into each other (convergence), this would suggest robustness and reproducibility of the optimization.

x0 = c(0,0)
optim.f0 = gradDescent(f,x0)
x1 = c(-2,-2)
optim.f1 = gradDescent(f,x1)
x2 = c(2,1)
optim.f2 = gradDescent(f,x2)
x3 = c(3,-1)
optim.f3 = gradDescent(f,x3)
x4 = c(4,2)
optim.f4 = gradDescent(f,x4)

# Draw our simple function in 2d, and all gradient descent paths on top
r = surface(f,from.x=-2,to.x=5,from.y=-5,to.y=2, theta=45,phi=-5,xlab="x",ylab="x2",zlab="f")
lines(trans3d(optim.f0$xmat[1,],optim.f0$xmat[2,],apply(optim.f0$xmat,2,f),r),lwd=4,col="red")
lines(trans3d(optim.f1$xmat[1,],optim.f1$xmat[2,],apply(optim.f1$xmat,2,f),r),lwd=4,col="green")
lines(trans3d(optim.f2$xmat[1,],optim.f2$xmat[2,],apply(optim.f2$xmat,2,f),r), lwd=4,col="blue")
lines(trans3d(optim.f3$xmat[1,],optim.f3$xmat[2,],apply(optim.f3$xmat,2,f),r), lwd=4,col="purple")
lines(trans3d(optim.f4$xmat[1,],optim.f4$xmat[2,],apply(optim.f4$xmat,2,f),r),lwd=4,col="orange")
points(trans3d(3,-4,f(c(3,-4)),r),col="red",cex=2, lwd=4, pch=0)  # optimum solution

So, all iterative solution pathways in this convex problem do lead to the optimal point. However, problems are not always convex, and this process can be significantly disrupted.

Let’s look at another example, $_{x=(x_1,x_2)} f_1(x)={(x_12-x_22+3)(2x_1+ 1- e^{x_2})} $.

f1 = function(x) {
  return((1/2*x[1]^2-1/4*x[2]^2+3)*cos(2*x[1]+1-exp(x[2])))
}

x0 = c(0.5,0.5)
optim.f0 = gradDescent(f1,x0,stepSize=0.01)
x1 = c(-0.1,-1.3)
optim.f1 = gradDescent(f1,x1,stepSize=0.01,max.iter=400)
x2 = c(-0.5,-1.3)
optim.f2 = gradDescent(f1,x2,stepSize=0.01,max.iter=400)
x3 = c(-0.2,1.4)
optim.f3 = gradDescent(f1,x3,stepSize=0.01,max.iter=400)
x4 = c(-0.5,-0.5)
optim.f4 = gradDescent(f1,x4,stepSize=0.01,max.iter=400)

# Show the f1 optimization space, and plot all gradient descent pathways
r = surface(f1,from.x=-2,to.x=2,from.y=-2,to.y=2,theta=-10,phi=70,xlab="x1",ylab="x2",zlab="f1")
lines(trans3d(optim.f0$xmat[1,],optim.f0$xmat[2,],apply(optim.f0$xmat,2,f1),r),lwd=4,col="red")
lines(trans3d(optim.f1$xmat[1,],optim.f1$xmat[2,],apply(optim.f1$xmat,2,f1),r),lwd=4,col="green")
lines(trans3d(optim.f2$xmat[1,],optim.f2$xmat[2,],apply(optim.f2$xmat,2,f1),r),lwd=4,col="blue")
lines(trans3d(optim.f3$xmat[1,],optim.f3$xmat[2,],apply(optim.f3$xmat,2,f1),r),lwd=4,col="purple")
lines(trans3d(optim.f4$xmat[1,],optim.f4$xmat[2,],apply(optim.f4$xmat,2,f1),r),lwd=4,col="orange")

Clearly, the non-convex problem has divergent solution pathways, that depend on the initial conditions and the topology of the space. Convexity represents the fundamental difference between these two optimization problems (\(f(x)\) and \(f_1(x)\)).

If \(f(x)\) is a convex real-valued function defined on a convex domain \(D\), \(f:D \to \Re\), then \(\forall x_1, x_2 \in D\) and \(\forall t \in [0, 1]\): \[f(tx_1+(1-t)x_2)\leq t f(x_1)+(1-t)f(x_2).\] The convex optimization problem is to find a point \(x^\ast \in D\) for which \(f(x)\) is optimized, i.e., \(f(x^\ast) \le f(x), \forall x \in D\).

In finite-dimensional normed spaces, the Hahn-Banach theorem provides theoretical necessary and sufficient conditions for optimality, Duality theory generalizes the classical linear programming problem to more complex situations and provides effective computational methods.

9.3.1 Notes

  • All extrema of convex functions are global, i.e., there are no local extrema. Optimization of convex problems is of polynomial complexity.
  • Gradient descent optimization provides solutions to most smooth convex functions, subject to selecting appropriate numeric parameters (step-size, i.e., learning rate)
  • Gradient descent may often fail for non-convex functions, where moving downhill locally may not always lead to the optimum,
  • Non-convex optimization is NP-hard problem and is much harder in general to solve.

9.4 Newton’s method

As a second order method, Newton’s optimization is a bit more sophisticated than gradient descent, and may also be more accurate. Let’s recall the second order Taylor expansion of functions.

\[f(x) \approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2. \] This represents a more accurate approximation of \(f(x)\) near \(x_0\). This directly generalizes to the multivariate case, \(f(x): D\subset\Re^n \to \Re\): \[f(x) \approx f(x_0) + \nabla f(x)^T(x-x_0) + \frac{1}{2}(x-x_0)^T \nabla^2 f(x_0) (x-x_0).\] The basic idea of Newton’s method is to repeatedly expand the second-order Taylor representation of the objective function, reduce the size of the quadratic term on the right-hand side, and iterate.

9.4.1 Newton’s method pseudo code

Repeat the formulation and minimization of the quadratic approximation to \(f\).

  1. Start with initial solution guess \(x_{(0)}\), a step-size \(\eta\) (learning step), and a tolerance level (\(\epsilon\)),
  2. Iterate for \(k=1,2,3,\ldots\):
    • Compute the gradient \(g=\nabla f(x_{(k-1)})\), Hessian \(H=\nabla^2 f(x_{(k-1)})\)
    • If gradient is close to zero (\(H<\epsilon\)), then stop, otherwise continue on
    • Update the solution to \(x_{(k)} = x_{(k-1)} - \eta H^{-1} g\)
  3. Return final \(x_{(k)}\) as approximate solution \(x^*\).

9.4.2 Advantages and disadvantages of the Newton’s method

Advantages:

  • Converges faster to a solution than gradient descent, because jointly, the Hessian and gradient together point in a more reliable direction than the gradient does alone
  • For nice functions, the expected number of iterations to get \(f(x_{(k)})-f(x*) \leq \epsilon\), is \(k\approx \log\log(1/\epsilon)\) iterations
  • Typically, it requires fewer iterations than gradient descent
  • For quadratic functions, it converges in just one step

Disadvantages:

  • In principle, each Newton iteration is much more expensive than its gradient descent counterpart than a single gradient descent update. Requires inverting the Hessian
  • If the Hessian isn’t invertible (or is close to singular), the algorithm may fail.

9.5 Stochastic Gradient Descent

Recall from Chapter 4, that to estimate the linear regression coefficients on \(p\) variables, we minimize the fidelity term:

\[ f(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - x_i^T \beta)^2. \] For observed responses \(y_i\) and predictors \(x_i\), \(i=1,\ldots n\), the fidelity represents the standard least squares loss. When \(n,p\) are reasonable in size, then we can just solve this using linear algebra:

\[\beta^* = (X^T X)^{-1} X^T y,\] where \(X_{n\times p}\) (design) predictor matrix (with \(i\)th row \(x_i\)), and \(y_{n\times 1}\) is the response vector (with \(i\)th component \(y_i\)).

For extreme sample or parameter sizes, e.g. when \(n=10^k\), it me be difficult to compute \(X^T X\), which will impede the computational solution. Gradient descent may work, but may also be expensive, because the gradient \(\nabla f(\beta) = \frac{1}{n} \sum_{i=1}^n x_i(x_i^T \beta - y_i)\) involves a sum of \(n\) terms.

This problem leads to the development of stochastic methods that overcome that challenge of large-scale statistical optimization. The idea is to simplify the calculations so that each time we need to compute a complex gradient or a Hessian, we can approximate it by a simpler analogue computed by random estimation.

For instance, instead of updating the solution using \[\beta - t \nabla f(\beta) = \beta - \frac{t}{n} \sum_{i=1}^n x_i(x_i^T \beta - y_i), \]

which may be prohibitively expensive, we can update the solution by \[\beta - t g = \beta - \frac{t}{m} \sum_{i \in I} x_i (x_i^T \beta - y_i).\]

The range of the sum above is reduced to \(I\subset\{1, 2, ..., n\}\), which represents a random subset of \(\{1,\ldots n\}\) of size \(m\ll n\). In other words, we’ve replaced the full gradient by a stochastic gradient estimate: \[g = \frac{1}{m} \sum_{i \in I} x_i(x_i^T \beta - y_i).\] Note that the latter is an unbiased estimate of the former and is computed over just \(m\) data points. This is much more computationally efficient when \(m\) is relatively small. In general, stochastic gradient descent makes rapid initial progress in minimizing the objective function at the start, however, it may also take longer to get to highly accurate solutions at the end.

9.5.1 Stochastic Gradient Descent: Pseudo code

Let’s demonstrate the pseudo code implementation of stochastic gradient descent for high-dimensional data with a large number of predictors, \(p\). For both statistical as well as optimization reasons, complete search over all \(p\) regression parameters may be impractical. In a statistical sense, doing a full search may yield bad estimates (e.g., with high variance), and in a computational sense, the algorithm may be very expensive and slow to converge. Therefore, when we have large number of features (\(p\gg 10^m\)), it’s likely that most of the predictors may be less important and only a few may be salient features. We can ignore the small effects by shrinking them to zero, just like we did in Chapter 17 for regularized linear modeling. Below is an example of a sparse stochastic gradient decent pseudo code:

  1. Start with initial guess \(\beta^{(0)}\), step-size \(\eta\), and tolerance level \(\epsilon\)
  2. For \(k=1,2,3,\ldots\):
    • Randomly shuffle indices of the cases and select the reduced index set, \(I\subset\{1, 2, ..., n\}\)
    • Approximate the true gradient \(\nabla f(\beta^{(k-1)})\), by \(g = \frac{1}{m} \sum_{i \in I} x_i(x_i^T \beta - y_i)\)
    • Check if gradient is close to zero; is so stop, otherwise continue
    • Update \(\beta^{(k)} = \beta^{(k-1)} - \eta \nabla f(\beta^{(k-1)})\)
    • Correct \(\beta^{(k)}\) by thresholding the small components to zero
  3. Return final \(\beta^{(k)}\) as approximate solution \(\beta^*\).

Note that if all \(\beta\) get smaller at the same time that we shrink the small ones to zero, then this will guarantee the converge of the algorithm and will yield the LASSO solution.

9.6 Simulated Annealing (SANN)

Simulated annealing is another approach for probabilistic approximation of the global optimum of an objective function over a large domain (search space). It is very effective for discrete search spaces and for problems where quick identification of an approximate global optimum is more important than finding an accurate local optimum. Annealing is a term used in in metallurgy to control repeated heating and cooling of materials to increase the size or reduce defects in the materials. In SANN, the simulation of annealing refers to finding approximations of the global minimum of the cost function with a large number of variables by equilibration (annealing) and a simulated walk through the solution space to slow decrease in the probability of accepting worse solutions compared to previously censored solutions.

9.6.1 Pseudocode

Below is a simulated annealing pseudocode that starts from a state \(s_o\) and continues until a maximum of \(k_{max}\) steps are completed. At each step, the neighbor(s) call generates a randomly chosen neighbor of a given state \(s\), and random(0, 1) returns a random Uniform distribution value. The annealing schedule is defined by temperature(r), which yields the temperature to use, given the fraction \(r\) of the time budget that has been expended so far.

  1. Initialize the algorithm by setting \(s=s_o\)
  2. For \(k\in \{0, ..., k_{max}\}\):
    • \(T\longleftarrow \text{temperature}(k/k_{max})\)
    • Pick a random neighbor, \(s_{new} \longleftarrow neighbor(s)\)
    • If \(P(E(s), E(s_{new}), T) \ge random(0, 1)\), then \(s \longleftarrow s_{new}\)
  3. Output: the final state, \(s\).

Below is a SANN example to optimize \(f(x_1,x_2) = (x_1 - 3)^2 + (x_2 +4)^2\).

f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 }
initial_x <- c(0, -1)
x_optimal <- optim(initial_x, f, method="SANN") # performs Simulated Annealing minimization
x_min <- x_optimal$par
# x_min contains the domain values where the (local) minimum is attained
x_min   # critical point/vector
## [1]  2.995954 -4.001052
x_optimal$value 
## [1] 1.747668e-05

10 Practice examples

10.1 Application: Healthcare Manufacturer Product Optimization

To improve the return on investment for their shareholders, a healthcare manufacturer needs to optimize their production line. The organization’s data-analytics team is tasked with determining the optimal production quantities of two products to maximize the company’s bottom line. The pair of core company products include:

  • A new non-invasive colonoscopy testkit (CTK) that retails at \(\$339\) per kit,
  • A novel statistical data obfuscation software tool (DataSifter) that enables the secure sharing of sensitive data, which costs \(\$399\) per year.

The production cost of the healthcare company to make, and support each of these products is \(\$195\) per CTK set and \(\$225\) per DataSifter License. Additional company operational fixed costs include \(\$400,000\) per year. In competitive market conditions, the number of sales of these healthcare products (a testkit and a software license) does affect the sale prices. Assume for each product, the sales price drops by one cent (\(1 ¢\)) for each additional item sold. There are also an association between the sales of the CTK and DataSifter products. The company historical sales suggest that the CTK unit price is reduced by an additional \(0.3 ¢\) for each DataSifter license purchased. Similarly, the price for the DataSifter license decreases by \(0.4 ¢\) for each CTK sold. These product price fluctuations are due to partnerships with wholesale vendors and package deals with academic and research institutions. The healthcare manufacturer believes that stable market conditions constant production and support of their two flagship products, along with these above assumptions, would maximize the volume of sales. The key question the analytics team needs to resolve is what is the optimal level of production. That is, how many units of each type of product should the healthcare company plan to manufacture and support (this includes R&D and product support) to maximize the company profit?

Let’s first translate the problem formulation into a mathematical optimization framework using this notation:

  • \(s_1\): the number of CTK units produced annually by the healthcare company, \(s_1\geq 0\)
  • \(s_2\): the number of DataSifter licenses issued/sold each year, \(s_2\geq 0\)
  • \(p_1\): the CTK sale price per unit (\(\$\)),
  • \(p_2\): the DataSifter price per license (\(\$\)),
  • \(C\): the R&D plus manufacturing costs (\(\$\)/year),
  • \(R\): total sales revenue (\(\$\)/year),
  • \(P\) : total profit from all sales (\(\$\)/year).

The provided market estimates result in the following model equations,

\[\begin{array}{lcl} p_1 & = & 339 - 0.01s_1 - 0.003s_2 \\ p_2 & = & 399 - 0.004s_1 - 0.01s_2 \\ R & = & s_1 p_1 + s_2 p_2 \\ C & = & 400,000 + 195s_1 + 225 s_2 \\ P & = & R-C \end{array}\]

By plugging in and expressing \(P\) as a function of \(s_1\) and \(s_2\), the objective cost function (profit) is a nonlinear function of the two product sales \((s_1, s_2)\)): \[f =P (s_1, s_2) = -400,000 + 144s_1 + 174s_2 - 0.01s_1^2 - 0.01s_2^2 - 0.007s_1s_2.\]

10.1.1 Unconstrained Optimization

Let’s assume there are no constrains other than, \(s_1, s_2\geq 0\). To solve the unconstrained optimization problem means find the \(s_1\) and \(s_2\) sales numbers that maximize the profit (\(P\)) in the first quadrant, \(\{(s_1,s_2)\in R^2 | s_i \geq 0\}\). To identify candidate extreme point \(s_1\) and \(s_2\) that maximize \(P\), we set the partial derivatives to zero and solve the following linear system of equations:

\[\begin{array}{lcl} \frac{\partial P}{\partial s_1} & = & 144 - 0.02s_1 - 0.007s_2 & = 0\\ \frac{\partial P}{\partial s_2} & = & 174 - 0.007s_1 - 0.02s_2 & = 0 \end{array}\]

The unique solution of is \((s_1^o,\ s_2^o) = (4,735,\ 7,043)\), which yields a maximum profit value \(P^o(s_1^o,s_1^o) =553,641\). As \(s_i^o \geq 0\), the solution is indeed in the feasible region (first planar quadrant). To examine the type of extremum (min or max), let’s inspect the (symmetric) Hessian matrix, \(H_P(s_1^o,\ s_2^o)\), including the second order derivatives of the Profit:

\[H_P(s_1^o,\ s_2^o) = \begin{bmatrix} \frac{\partial^2 P}{\partial s_1^2} & \frac{\partial^2 P}{\partial s_1 \partial s_2} \\ \frac{\partial^2 P}{\partial s_2 \partial s_1} & \frac{\partial^2 P}{\partial s_2^2} \end{bmatrix} = \begin{bmatrix} -0.02 & -0.007 \\ -0.007 & -0.02 \end{bmatrix}.\]

As the problem is convex, a sufficient condition for maximizing the profit at \((s_1^o,\ s_2^o) = (4,735,\ 7,043)\) is that the Hessian, \(H_P(s_1^o,\ s_2^o)\), is negative semi-definite, i.e., \(X^T H_P(s_1^o,\ s_2^o) X\leq 0\), for all 2D vectors \(X=(x_1,x_2),\text{ where } x_1,x_2\geq 0\):

\[X^T H_P(s_1^o,\ s_2^o) X = \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} -0.02 & -0.007 \\ -0.007 & -0.02 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} -0.02x_1 -0.007x_2 \\ -0.007x_1 -0.02x_2 \end{bmatrix} =\] \[=-0.02x_1^2 -0.014x_1 x_2 -0.02x_2^2\leq 0, \ \forall x_1,x_2\geq 0.\]

Let’s display a 3D surface plot of the profit objective function, \(P (s_1, s_2)\). As you move the mouse over the interactive surface, note that profit isolines form closed curves surrounding the maximum at \((s_1^o,\ s_2^o) = (4,735,\ 7,043)\).

grid_length <- 101

Profit_function <- function(x) { # input bivariate x is a matrix n*2
  x <- matrix(x, ncol=2)
  z <- -400000 + 144*x[ ,1] + 174*x[ ,2] - 0.01*x[ ,1]^2 - 0.01*x[ ,2]^2 - 0.007*x[ ,1]*x[ ,2]
  return(z)
}

# define the 2D grid to plot f=P around the optimal value $(s_1^o,\ s_2^o) = (4,735,\ 7,043)$
x <- seq(4700, 4800, length = grid_length)
y <- seq(7000, 7100, length = grid_length)
A <- as.matrix(expand.grid(x, y))
colnames(A) <- c("s1", "s2")
# evaluate function
z <- Profit_function(A)
# put X and y values in a data.frame for plotting
df <- data.frame(A, z)
library(plotly)
x_label <- list(title = "s1")
y_label <- list(title = "s2")
z_label <- list(title = "z=P(s1,s2)")

# Define vertical Line at arg_Max{Profit}
count <- 300
xl <- c(); yl <- c(); zl <- c()
for (i in 1:count) {   xl <- 4735; yl <- 7043 }
zl <- seq(553541, 553641, length.out=count)
vert_line <- data.frame(xl, yl, zl)

myPlotly <- plot_ly(x=~x, y=~y, z=~matrix(z, grid_length,grid_length), 
                    type="surface", colors = colorRamp(rainbow(8)), 
                    opacity=1.0, hoverinfo="none") %>%
  # add a line at arg_Max{P} = P(x = 4735, y = 7043)=553641
  add_trace(vert_line, x = ~xl, y = ~yl, z = ~zl, type = 'scatter3d', mode = 'lines',
        line = list(width = 4, color = "gray", colorscale = list(c(0,'gray'), c(1,'black')))) %>%
  # add a Ball centered at arg_max
  add_trace(x=4735, y=7043, z=553641, type="scatter3d", mode="markers") %>%
  layout(scene = list(xaxis=x_label, yaxis=y_label, zaxis=z_label), showlegend = FALSE)

myPlotly

10.1.2 Constrained Optimization

Next, we will explore more realistic scenarios where the company has limited resources restricting the number of products that can annually be produced and supported to \(0\leq s_1 \leq 5,000\), \(0\leq s_2 \leq 8,000\), and \(0\leq s_1 + s_2 \leq 10,000\).

Note that the current optimum production plan that maximizes the profit already satisfies the first two constraints, \((s_1^o=4,735 \leq 5,000\) and \(s_2^o = 7,043\leq 8,000\), however it violates the last constraint \(s_1^o + s_2^o = 4,735 + 7,043 =11,778\geq 10,000\). Thus, the global Profit maximum point is not outside the feasible region and the constraint problem optimal (max Profit) must be on the boundary of the convex domain. We will apply non-linear constrained optimization to maximize the Profit:

\[f =P (s_1, s_2) = -400,000 + 144s_1 + 174s_2 - 0.01s_1^2 - 0.01s_2^2 - 0.007s_1s_2.\] subject to

\[constraint(s_1, s_2):\ s_1 + s_2 - 10,000 = 0.\]

10.1.2.1 Manual Solution

Let’s first solve the problem by hand using first-principles by translating primary problem to a dual problem using Lagrange multipliers. The dual Profit function will be \(P^*(s_1, s_2, \lambda) = P(s_1, s_2)+\lambda\times constraint(s_1, s_2)\). To optimize it, we set \(\nabla P^*=0\). This yields \(\nabla P= -\lambda \nabla constraint\), i.e.,

\[\begin{array}{lcl} \frac{\partial}{\partial s_1}: & 144 -0.02s_1 -0.007s_2 & = & -\lambda \\ \frac{\partial}{\partial s_2}: &174 -0.007s_1 -0.02s_2 & = & -\lambda \end{array}\]

We can substitute \(\lambda\) to get a single linear relation between \(s_1\) and \(s_2\), \(-0.013s_1+0.013s_2=30\). Pairing this linear equation with the additional boundary constraint (\(s_1 + s_2 = 10,000\)) yields a system of two linear equations with two unknowns \((s_1, s_2)\), which may have a unique solution that maximizes the Profit objective function given all given restrictions on the operations of the healthcare company:

\[\begin{array}{lcl} -0.013s_1& +& 0.013s_2 & = & 30 \\ s_1 & + &s_2& = & 10,000 \end{array}\]

Substituting \(s_2\), or \(s_1\), from the second constraint equation into the first one yields a (\(\arg\min\)) solution \[(s_1',\ s_2') = (3,846,\ 6,154),\] with a corresponding Profit value \(P'(3,846,\ 6,154) = 532,308\). The latter profit is less than the global profit maximum at \((s_1^o,\ s_2^o) = (4,735,\ 7,043)\). Recall that the global maximum profit was: \[P^o(s_1^o,s_1^o)=P^o(4,735,\ 7,043) =553,641.\]

10.1.2.2 R-based Automated solution

Next we will solve the constraint optimization problem using Rsolnp::solnp() and confirm that the two approaches yield the same results - maximum profit subject to production limitations for the CTK and the DataSifter healthcare products. Recall the formulation of this quadratic Profit optimization problem and its linear constraint:

\[f =P (s_1, s_2) = -400,000 + 144s_1 + 174s_2 - 0.01s_1^2 - 0.01s_2^2 - 0.007s_1s_2.\] \[\min_{(s_1,s_2)} {\left ( - P (s_1, s_2)\right )} = \max_{(s_1,s_2)} {P (s_1, s_2)},\] subject to \[constraint(s_1, s_2):\ s_1 + s_2 \leq 10,000.\] Mind that in the earlier manual solution, we used multivariate calculus knowledge to argue that extrema must occur at points where the gradient of the Profit objective is trivial, or at points of discontinuity or non-differentiability, or on the boundary. Hence, the constraint used equality, \(constraint(s_1, s_2):\ s_1 + s_2 = 10,000\). Here, as we are using the generic non-linear optimization, we need to specify the real (inequality) constraint \(constraint(s_1, s_2):\ s_1 + s_2 \leq 10,000\).

library(Rsolnp)
# define objective profit to min: -P(s1, s2) = -(-400,000 + 144s_1 + 174s_2 - 0.01s_1^2 - 0.01s_2^2 - 0.007s_1 s_2)
Profit_function <- function(x) { # input bivariate x is a matrix n*2
  x <- matrix(x, ncol=2)
  z <- -(-400000 + 144*x[ ,1] + 174*x[ ,2] - 0.01*x[ ,1]^2 - 0.01*x[ ,2]^2 - 0.007*x[ ,1]*x[ ,2])
  return(z)
}
# constraints s1, s2 >=0, and z1: s1 + s2 <= 10,000
ineqn1 <- function(x) { 
  z1=x[1]
  z2=x[2]
  z3=x[1] + x[2]
  return(c(z1, z2, z3))
}

lh <- c(0, 0, 0)                  # x1, x2 and x1 + x2 >=0
uh <- c(10000, 10000, 10000)      # x1, x2 and x1 + x2 <= 10,000

x0 = c(4000, 5000) # setup initial values
sol2 <- solnp(x0, fun = Profit_function, ineqfun = ineqn1, ineqLB=lh, ineqUB=uh)
## 
## Iter: 1 fn: -532307.5079  Pars:  3846.31885 6153.67348
## Iter: 2 fn: -532307.6913  Pars:  3846.29923 6153.70074
## Iter: 3 fn: -532307.6918  Pars:  3846.26334 6153.73664
## solnp--> Completed in 3 iterations
cat("Max Profit = $", -sol2$values[length(sol2$values)], "\n")  # last value is the optimal value
## Max Profit = $ 532307.7
cat ("Location of Profit Max = (s1(#CTK)=", sol2$pars[1], ", s2(#DataSifter)=", sol2$pars[2], ")!\n")  # argmin (s1,s2)
## Location of Profit Max = (s1(#CTK)= 3846.263 , s2(#DataSifter)= 6153.737 )!

10.2 Example 1: Optimization of the Booth’s function

Find the extrema of the Booth’s function \(f(x,y)=(x + 2y -7)^2 + (2x+y - 5)^2\) on the square \(S=\{(x,y)\in R^2 | -10\leq x,y\leq 10\}\).

grid_length <- 101

Booth_function <- function(A) { # input bivariate x is a matrix n*2
  A <- matrix(A, ncol=2)
  z <- ((A[,1] + 2*A[,2] -7)^2 + (2*A[,1]+A[,2] - 5)^2)
  return(z)
}

# define the 2D grid to plot f over
x <- seq(-10, 10, length = grid_length)
y <- seq(-10, 10, length = grid_length)
A <- as.matrix(expand.grid(x, y))
colnames(A) <- c("x", "y")
# evaluate function
z <- Booth_function(A)
# put X and y values in a data.frame for plotting
df <- data.frame(A, z)
library(plotly)
z_label <- list(
  title = "z=f(x,y)"
)

myPlotly <- plot_ly(x=~x, y=~y, z=~matrix(z, grid_length,grid_length), 
                    type="surface", colors = colorRamp(rainbow(8)), 
                    opacity=1.0, hoverinfo="none") %>%
layout(scene = list(zaxis=z_label))

myPlotly

10.3 Example 2: Extrema of the bivariate Goldstein-Price Function

Minimize and maximize the Goldstein-Price function \[f(x,y) \sim \left (1+(x+y+1)^2(19-14x+3x^2-14y+6xy+3y^2)\right )\times \left (30+(2x-3y)^2(18-32x+12x^2+48y-36xy+27y^2)\right ).\] Use Nelder-Mead, Simulated Annealing, and conjugate gradient optimization methods. Then report a table with the extrema points and corresponding functional values.

NM_min <- optim(c(1,1), GP_function, method = "Nelder-Mead")

GP_function <- function(A) { # input bivariate x is a matrix n*2
  A <- matrix(A, ncol=2)
  x <- A[ , 1]
  y <- A[ , 2]
  # calculate the function value for each row of x
  z_xy <- ((1+(x+y+1)^2*(19-14*x+3*x^(2)-14*y+6*x*y+3*y^(2))))*
    (30+(2*x-3*y)^2*(18-32*x+12*x^2+48*y-36*x*y+27*y^2))/(10^5)
  # return function value
  return(z_xy)
}
x <- seq(-5, 5, length = grid_length)
y <- seq(-5, 5, length = grid_length)
# plot the function
A <- as.matrix(expand.grid(x, y))
colnames(A) <- c("x", "y")
# evaluate function
z <- GP_function(A); length(z)
## [1] 10201
df <- data.frame(A, z)
# plot the function

library(plotly)
myPlotly <- plot_ly(x=~x, y=~y, z=~matrix(df$z, grid_length,grid_length), 
                    type="surface", colors = colorRamp(rainbow(8)), opacity=0.9) %>%
  layout(scene = list(zaxis=z_label))
myPlotly

10.4 Example 3: Bivariate Oscillatory Function

Determine the extrema of a complicated oscillatory function \[f(x,y) = -(y+50)\cos\left (\sqrt{\left |y+x+50 \right |}\right ) - x\sin \left ( \sqrt{\left |x-y-50 \right |}\right ).\]

BO_function <- function(A) {
  A <- matrix(A, ncol=2)
  x <- A[ , 1]
  y <- A[ , 2]
  # calculate the function value for each row of x
  z_xy <- -(y+50)*cos(sqrt(abs(y+x+50)))-x*sin(sqrt(abs(x-(y+50))))
  return(z_xy)
}

x <- seq(-200, 200, length = grid_length)
y <- seq(-200, 200, length = grid_length)
A <- as.matrix(expand.grid(x, y))
colnames(A) <- c("x", "y")
# evaluate function
z <- BO_function(A)
# put X and y values in a data.frame for plotting
df <- data.frame(A, z)
library(plotly)
z_label <- list(
  title = "z=f(x,y)"
)
myPlotly <- plot_ly(x=~x, y=~y, z=~matrix(df$z, grid_length,grid_length), 
                    type="surface", colors = colorRamp(rainbow(8)), 
                    opacity=0.8, hoverinfo="none", showscale=FALSE ) %>%
  layout(scene = list(zaxis=z_label), showlegend = FALSE)
myPlotly

10.5 Nonlinear Constraint Optimization Problem

Maximize this objective function (a mixture of polynomial and exponential components): \[f(x,y,z)=x^3 + 5y -2^z\]
subject to \[D = \begin{cases} x -\frac{y}{2}+z^2 \leq 50\\ \mod(x, 4) + \frac{y}{2} \leq 1.5 \end{cases} .\]

Alternatively, minimize \(f^*(x,y,z)=-(x^3 + 5y -2^z)\). It’s difficult to plot a 4D surface in 2D or 3D space, but we can try animating or cross-sectioning the \(w=f(x,y,z)\) surface. The code below can be used, or modified, to generate a dynamic surface or a volume rendering of the objective function, \(f^*(x,y,z)\). This code is not executed here (eval=F) to reduce the size of the output HTML output file.

library(plotly)
grid_length <- 101
BO_function <- function(A) {
  A <- matrix(A, ncol=3)
  x <- A[ , 1]
  y <- A[ , 2]
  z <- A[ , 3]
  # calculate the function value for each row of x
  w_xyz <- -(x^3 + 5*y -2^z)
  return(w_xyz)
}

x <- seq(-50, 50, length = grid_length)
y <- seq(-50, 50, length = grid_length)
z <- seq(-50, 50, length = grid_length)
A <- as.matrix(expand.grid(x, y, z))
colnames(A) <- c("x", "y", "z")

# evaluate function
w <- log(abs(BO_function(A)))   # apply log(abs()) to tamper the the large w values
# put x, y and z values in a data.frame for plotting
df <- data.frame(A, w); str(df)
# Show that the histogram has a lower bound (corresponding to a minimum)
# hist(df$w, xlim=c(-20000000000, 10000000000), freq=T, breaks = 10, 
#     main="Histogram of f has a lower bound\n (corresponding to a minimum)")

z_label <- list(title = "w=f(x,y_o,z)")
myPlotly <- plot_ly(df, x=~x, y=~y, z=~matrix(df$w, grid_length,grid_length,grid_length), frame = ~df$z,
                    type="surface", colors = colorRamp(rainbow(8)), 
                    opacity=0.8, hoverinfo="none", showscale=T ) %>%
  layout(scene = list(zaxis=z_label), showlegend = FALSE)  %>%
  animation_opts(500, easing = "elastic", redraw = F) %>% 
  animation_slider(active = 50, currentvalue = list(prefix = "Z ", font = list(color="red")))
  
myPlotly

# 3D Volume rendering
library(brainR)
vol_surface <- array(df$w, c(grid_length,grid_length,grid_length)); dim(vol_surface)
contour3d(vol_surface, level = c(-10,-8,0), color =c("red", "green", "blue"), alpha = 0.1, draw = TRUE)

Next, we can use solnp to optimize the objective function.

fun3 <- function (x){
  -(x[1]^3 + 5*x[2] - 2^x[3])
}

# constraint z1:
# constraint z2:
# constraint z3: z<=10
ineq3 <- function(x){
  z1 = x[1] - (x[2]/2) + (x[3]^2)
  z2 = (x[1] %% 4) + (x[2]/2)
  z3 = x[3]
  return(c(z1, z2, z3))
}

lb = c(-1000,-1000,-1000)
ub = c(1000,1000,100)
lh <- c(-Inf, -Inf, -Inf) # tested lower bound and it doesn't change with c = -100 and c = -1000
uh <- c(50, 1.5, Inf)

x0 <- c(1,0,1) # setup: Try alternative starting-parameter vector (pars)
sol3 <- solnp(x0, fun=fun3, ineqfun=ineq3, ineqLB=lh, ineqUB=uh)   #, LB=lb, UB=ub)
## 
## Iter: 1 fn: -11.2821  Pars:  0.000002152 2.817591348 1.488463477
## Iter: 2 fn: -11.2821  Pars:  0.000002152 2.817591348 1.488463477
## solnp--> Completed in 2 iterations
cat("Min is attained at: (", sol3$pars[1], ", ", sol3$pars[2], ", ", sol3$pars[1], ")\n")
## Min is attained at: ( 2.152334e-06 ,  2.817591 ,  2.152334e-06 )
cat("Min_f = ", sol3$values[length(sol3$values)])
## Min_f =  -11.28206

Finally, we can also try the optim() method using simulated annealing, a slower stochastic global optimization optimizer that works well with difficult functions, e.g., non-differentiable, non-convex.

# Define the constraints with conditions returning NA to restrict SANN stochastic walk
fun3_constraints <- function(x) {
  if (x[1] - (x[2]/2)+x[3]^2 > 50) {NA}       # constraint 1
  else if ((x[1] %% 4) + (x[2]/2) > 1.5) {NA} # constraint 2
  else { fun3(x) }                            # the objective function, 
}

# Initial conditions, chosen analytically
x0 <- c(45, 0, 0)

# Initialize optimization using dummy variables
x_optimal_value <- Inf
x_optimal_point <- c(NA, NA)

# Run 50 iterations of SANN and choose the optimal result
for (i in 1:50) {
  x_optimal <- optim(x0, fun3_constraints, method="SANN")
  if (x_optimal$value < x_optimal_value) {
    x_optimal_value <- x_optimal$value
    x_optimal_point <- x_optimal$par
  }
}

cat("fun3 minimum estimate is=", x_optimal_value, " which is attained at (", x_optimal_point[1], ", ", x_optimal_point[2], ")!\n")
## fun3 minimum estimate is= -123108.9  which is attained at ( 49.74704 ,  -0.5046622 )!

Check you solution against the Wolfram Alpha solution: \[\min_{D} f(x,y,z)=f(x=0, y=3, z=-3.6)=-15.\]

The Convex Optimization in R paper provides lots of additional examples and Wikipedia provides a number of interesting optimizaiton test functions.

11 References

SOCR Resource Visitor number SOCR Email